Purpose

A typical R project may include the following steps (depending on the purpose):

Step 1: Reading libraries needed for the project

Step 2: Loading dataset

Step 3: Analysing and manipulating the dataset and data visualisation

Step 4: Fitting models and testing goodness of fit

Step 5: Data output

This workshop is to show an example of using R for generalised linear model fitting and to help people get started with R.

This document contains R codes used in this example

Contents

Step 1: Loading R Libraries

Loading the R libraries that are needed for the project

#install and load packages
#install.packages("readxl")
library(readxl)
library(readr)
library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Step 2: Loading datasets

Import dataset using readxl and readr. In this example, we would load a dataset from package insuranceData.

For more Singapore Automobile Claims data description:

The data is from the General Insurance Association of Singapore, an organization consisting of general (property and casualty) insurers in Singapore (see the organization’s website: www.gia.org.sg).From this database, several characteristics are available to explain automobile accident frequency.These characteristics include vehicle variables, such as type and age, as well as person level variables, such as age, gender and prior driving experience.

https://instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/BookWebDec2010/DataDescriptions.pdf

About InsuranceData packages: https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf

#Inputfile<-"D:/YT_R/HandsOn/SingaporeData.csv"   
#InputData<-read_csv(Inputfile)
library(insuranceData)
data(SingaporeAuto)
df <-as.data.frame(SingaporeAuto)

Step 3: Analysing and manipulating the dataset

Highlevel Summary of Data

names(df)
##  [1] "SexInsured"  "Female"      "VehicleType" "PC"          "Clm_Count"  
##  [6] "Exp_weights" "LNWEIGHT"    "NCD"         "AgeCat"      "AutoAge0"   
## [11] "AutoAge1"    "AutoAge2"    "AutoAge"     "VAgeCat"     "VAgecat1"
dim(df)
## [1] 7483   15
colnames(df)
##  [1] "SexInsured"  "Female"      "VehicleType" "PC"          "Clm_Count"  
##  [6] "Exp_weights" "LNWEIGHT"    "NCD"         "AgeCat"      "AutoAge0"   
## [11] "AutoAge1"    "AutoAge2"    "AutoAge"     "VAgeCat"     "VAgecat1"
head(df,10)
##    SexInsured Female VehicleType PC Clm_Count Exp_weights    LNWEIGHT NCD
## 1           U      0           T  0         0   0.6680356 -0.40341383  30
## 2           U      0           T  0         0   0.5667351 -0.56786326  30
## 3           U      0           T  0         0   0.5037645 -0.68564629  30
## 4           U      0           T  0         0   0.9144422 -0.08944106  20
## 5           U      0           T  0         0   0.5366188 -0.62246739  20
## 6           U      0           T  0         0   0.7529090 -0.28381095  20
## 7           U      0           M  0         0   0.6707734 -0.39932384  20
## 8           U      0           M  0         0   0.8377823 -0.17699695  20
## 9           U      0           M  0         0   0.1670089 -1.78970819  20
## 10          U      0           M  0         0   0.5037645 -0.68564629  20
##    AgeCat AutoAge0 AutoAge1 AutoAge2 AutoAge VAgeCat VAgecat1
## 1       0        0        0        0       0       0        2
## 2       0        0        0        0       0       0        2
## 3       0        0        0        0       0       0        2
## 4       0        0        0        0       0       0        2
## 5       0        0        0        0       0       0        2
## 6       0        0        0        0       0       0        2
## 7       0        0        0        0       0       6        6
## 8       0        0        0        0       0       6        6
## 9       0        0        0        0       0       6        6
## 10      0        0        0        0       0       6        6
summary(df)
##  SexInsured     Female         VehicleType         PC        
##  F: 700     Min.   :0.00000   A      :3842   Min.   :0.0000  
##  M:3145     1st Qu.:0.00000   G      :2882   1st Qu.:0.0000  
##  U:3638     Median :0.00000   Q      : 358   Median :1.0000  
##             Mean   :0.09355   M      : 188   Mean   :0.5134  
##             3rd Qu.:0.00000   P      :  88   3rd Qu.:1.0000  
##             Max.   :1.00000   Z      :  71   Max.   :1.0000  
##                               (Other):  54                   
##    Clm_Count        Exp_weights          LNWEIGHT            NCD       
##  Min.   :0.00000   Min.   :0.005476   Min.   :-5.2074   Min.   : 0.00  
##  1st Qu.:0.00000   1st Qu.:0.279261   1st Qu.:-1.2756   1st Qu.: 0.00  
##  Median :0.00000   Median :0.503764   Median :-0.6856   Median :20.00  
##  Mean   :0.06989   Mean   :0.519859   Mean   :-0.8945   Mean   :19.85  
##  3rd Qu.:0.00000   3rd Qu.:0.752909   3rd Qu.:-0.2838   3rd Qu.:30.00  
##  Max.   :3.00000   Max.   :1.000000   Max.   : 0.0000   Max.   :50.00  
##                                                                        
##      AgeCat        AutoAge0         AutoAge1          AutoAge2      
##  Min.   :0.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :2.00   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :1.94   Mean   :0.3905   Mean   :0.05867   Mean   :0.05987  
##  3rd Qu.:4.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :7.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                     
##     AutoAge         VAgeCat         VAgecat1    
##  Min.   :0.000   Min.   :0.000   Min.   :2.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:2.000  
##  Median :1.000   Median :1.000   Median :2.000  
##  Mean   :0.509   Mean   :2.019   Mean   :2.933  
##  3rd Qu.:1.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.000   Max.   :6.000   Max.   :6.000  
## 
str(df)
## 'data.frame':    7483 obs. of  15 variables:
##  $ SexInsured : Factor w/ 3 levels "F","M","U": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Female     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ VehicleType: Factor w/ 9 levels "A","G","M","P",..: 7 7 7 7 7 7 3 3 3 3 ...
##  $ PC         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Clm_Count  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Exp_weights: num  0.668 0.567 0.504 0.914 0.537 ...
##  $ LNWEIGHT   : num  -0.4034 -0.5679 -0.6856 -0.0894 -0.6225 ...
##  $ NCD        : int  30 30 30 20 20 20 20 20 20 20 ...
##  $ AgeCat     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AutoAge0   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AutoAge1   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AutoAge2   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AutoAge    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ VAgeCat    : int  0 0 0 0 0 0 6 6 6 6 ...
##  $ VAgecat1   : int  2 2 2 2 2 2 6 6 6 6 ...
hist(df$Clm_Count)

table(df$Clm_Count)
## 
##    0    1    2    3 
## 6996  455   28    4
levels(as.factor(df$SexInsured))
## [1] "F" "M" "U"

Data manipulation

Data Analysis - understanding the dataset

#Summary of claim count by chosen factors using dplyr 
df%>%
      group_by(VehicleType,SexInsured)%>%
      summarise(number=n(),
                Count=sum(Clm_Count),
                Exposure=sum(Exp_weights),
                Frequency=Count/Exposure)
## # A tibble: 13 x 6
## # Groups:   VehicleType [9]
##    VehicleType SexInsured number Count Exposure Frequency
##    <fct>       <fct>       <int> <int>    <dbl>     <dbl>
##  1 A           F             700    50  362.       0.138 
##  2 A           M            3142   254 1600.       0.159 
##  3 G           M               1     0    0.167    0     
##  4 G           U            2881   169 1520.       0.111 
##  5 M           M               1     0    0.999    0     
##  6 M           U             187     4  105.       0.0383
##  7 P           U              88    11   46.5      0.237 
##  8 Q           U             358    28  193.       0.145 
##  9 S           U              16     1    8.64     0.116 
## 10 T           U               8     0    4.29     0     
## 11 W           M               1     0    0.914    0     
## 12 W           U              29     1   12.6      0.0793
## 13 Z           U              71     5   36.8      0.136
# Exercise : choose other factors and produce a summary table

Data Clean

Data Cleaning - generating a cleaned dataset

#Summary of claim count by chosen factors using dplyr 
df_clean <- df %>%
  transmute(SexInsured=factor(SexInsured),
            Female=factor(Female),
            Private=factor(PC),
            NCDCat=factor(NCD),
            AgeCat=factor(AgeCat),
            VAgeCat=factor(VAgeCat),
            Exp_weights,
            Clm_Count,
            Frequency = Clm_Count/Exp_weights
  )


# Exercise : choose other factors and produce a summary table

Data Visualisation

#Using self-defined functions: 
  
avg <- function(x) {
  dat <- aggregate(df$Clm_Count, by = list(df[, x]), FUN = mean)
  barplot(dat$x, xlab = x, ylab = "Claim Count Averages")
  axis(side=1, at=1:nrow(dat), labels=dat$Group.1)
}

par(mfrow=c(2,2))
avg(x = "AgeCat")
avg(x = "VehicleType")
avg(x = "AutoAge1")
avg(x = "VAgeCat")

  avg(x = "NCD")
  avg(x = "PC")
  avg(x = "SexInsured")

#Using plotly: 
  
df_clean$ID <- seq.int(nrow(df_clean))
y<-sort(df_clean$Frequency,decreasing=FALSE)
plot_ly(df_clean,x=df_clean$ID,y=y,name="test",type='scatter',mode='lines')
#Using ggplot2: 
  


#two-way 
ggplot(df_clean, aes(x = SexInsured, fill = NCDCat)) +
  geom_bar(position = "fill") +
  theme_classic()+theme(axis.text.x = element_text(angle = 90))

# explore frequency by factors
ggplot(data=df_clean, aes(x=AgeCat, y=Clm_Count/Exp_weights, fill=SexInsured)) +
  geom_bar(stat="identity", position=position_dodge())

ggplot(data=df_clean, aes(x=AgeCat, y=Clm_Count, fill=SexInsured)) +
  geom_bar(stat="identity", position=position_dodge())

ggplot(data=df_clean, aes(x=AgeCat, y=Clm_Count)) +
  geom_bar(stat="identity", position=position_dodge())

g <- ggplot(data=df_clean)+scale_y_continuous(labels = scales::percent)
germ_claims <- g+geom_bar(aes_string(x="Clm_Count",y="..prop..",weight="Exp_weights/sum(Exp_weights)"))+ggtitle("Claim counts")+labs(x=NULL, y="% exposure")
germ_claims

#Facet the claims bar chart

germ_claims+facet_grid(.~NCDCat)+ggtitle("Claim counts by NCDCat")

germ_claims+facet_grid(.~AgeCat)+ggtitle("Claim counts by AgeCat")

Step 4: Fitting models and testing goodness of fit

a) Split dataset to training and validation

training <- 0.80
df_training <- sample(1:nrow(df_clean), size = nrow(df_clean) * training, replace = FALSE)
df_train <- df_clean [df_training,]
df_validate <- df_clean [-df_training,]

b) Starting with an intercept model and check the mean against original data

poisson_intercept <- glm(Clm_Count~1, data =df_train, family=poisson(link=log),offset=log(Exp_weights))
summary(poisson_intercept)
## 
## Call:
## glm(formula = Clm_Count ~ 1, family = poisson(link = log), data = df_train, 
##     offset = log(Exp_weights))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5137  -0.4457  -0.3484  -0.2449   3.9932  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.02552    0.04945  -40.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2141.4  on 5985  degrees of freedom
## Residual deviance: 2141.4  on 5985  degrees of freedom
## AIC: 2918.8
## 
## Number of Fisher Scoring iterations: 6
paste("Check average model frequency",exp(poisson_intercept$coefficients[["(Intercept)"]]),"vs data average frequency",sum(df_train$Clm_Count)/sum(df_train$Exp_weights))
## [1] "Check average model frequency 0.131925376321422 vs data average frequency 0.13192537632136"

c) Fit a model with all factors and use stepwise function to test models

model_full <-glm(Clm_Count ~ .-Exp_weights-Frequency-ID , data =df_train, family=poisson(link=log),offset=log(Exp_weights))
model_stepwise <- step(poisson_intercept, scope=list(lower=poisson_intercept, upper=model_full), direction = "both")
## Start:  AIC=2918.83
## Clm_Count ~ 1
## 
##              Df Deviance    AIC
## + VAgeCat     6   2068.9 2858.4
## + NCDCat      5   2118.1 2905.6
## + Private     1   2131.3 2910.8
## + SexInsured  2   2130.3 2911.8
## + AgeCat      6   2123.8 2913.3
## <none>            2141.4 2918.8
## + Female      1   2141.4 2920.8
## 
## Step:  AIC=2858.37
## Clm_Count ~ VAgeCat
## 
##              Df Deviance    AIC
## + NCDCat      5   2042.7 2842.2
## <none>            2068.9 2858.4
## + Female      1   2067.3 2858.8
## + Private     1   2068.2 2859.6
## + SexInsured  2   2066.8 2860.3
## + AgeCat      6   2060.0 2861.5
## - VAgeCat     6   2141.4 2918.8
## 
## Step:  AIC=2842.21
## Clm_Count ~ VAgeCat + NCDCat
## 
##              Df Deviance    AIC
## <none>            2042.7 2842.2
## + Female      1   2041.2 2842.7
## + Private     1   2042.7 2844.2
## + SexInsured  2   2041.2 2844.6
## + AgeCat      6   2036.9 2848.4
## - NCDCat      5   2068.9 2858.4
## - VAgeCat     6   2118.1 2905.6

d) Fit a model using chosen factors and compare models using annova

formulas_1<-"Clm_Count ~ SexInsured"
formulas_2<-"Clm_Count ~ NCDCat + VAgeCat"
formulas_3<-"Clm_Count ~ SexInsured+NCDCat+AgeCat+VAgeCat"

poisson_reg1 <- glm(formulas_1, data =df_train, family=poisson(link=log),offset=log(Exp_weights))
summary(poisson_reg1)
## 
## Call:
## glm(formula = formulas_1, family = poisson(link = log), data = df_train, 
##     offset = log(Exp_weights))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5606  -0.4318  -0.3434  -0.2368   4.1161  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.0397     0.1644 -12.407   <2e-16 ***
## SexInsuredM   0.1892     0.1789   1.058    0.290    
## SexInsuredU  -0.1561     0.1813  -0.861    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2141.4  on 5985  degrees of freedom
## Residual deviance: 2130.3  on 5983  degrees of freedom
## AIC: 2911.8
## 
## Number of Fisher Scoring iterations: 6
poisson_reg2 <- glm(formulas_2, data =df_train, family=poisson(link=log),offset=log(Exp_weights))
summary(poisson_reg2)
## 
## Call:
## glm(formula = formulas_2, family = poisson(link = log), data = df_train, 
##     offset = log(Exp_weights))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7621  -0.4265  -0.3116  -0.2001   3.7668  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.5710     0.1014 -15.500  < 2e-16 ***
## NCDCat10     -0.3733     0.1442  -2.589  0.00961 ** 
## NCDCat20     -0.4350     0.1475  -2.950  0.00318 ** 
## NCDCat30     -0.3535     0.2148  -1.645  0.09992 .  
## NCDCat40     -0.7144     0.2723  -2.623  0.00871 ** 
## NCDCat50     -0.6656     0.1509  -4.411 1.03e-05 ***
## VAgeCat1      0.2209     0.1556   1.420  0.15572    
## VAgeCat2      0.3353     0.1534   2.186  0.02885 *  
## VAgeCat3      0.1108     0.1670   0.664  0.50678    
## VAgeCat4     -0.4739     0.1858  -2.551  0.01075 *  
## VAgeCat5     -1.2482     0.2326  -5.366 8.07e-08 ***
## VAgeCat6     -1.5681     0.5842  -2.684  0.00727 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2141.4  on 5985  degrees of freedom
## Residual deviance: 2042.7  on 5974  degrees of freedom
## AIC: 2842.2
## 
## Number of Fisher Scoring iterations: 6
poisson_reg3 <- glm(formulas_3, data =df_train, family=poisson(link=log),offset=log(Exp_weights))
summary(poisson_reg3)
## 
## Call:
## glm(formula = formulas_3, family = poisson(link = log), data = df_train, 
##     offset = log(Exp_weights))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8912  -0.4229  -0.3086  -0.1989   3.7722  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.9951   298.1363  -0.040  0.96791    
## SexInsuredM   0.2319     0.1795   1.292  0.19643    
## SexInsuredU  10.4117   298.1363   0.035  0.97214    
## NCDCat10     -0.3815     0.1444  -2.642  0.00823 ** 
## NCDCat20     -0.4346     0.1476  -2.945  0.00323 ** 
## NCDCat30     -0.3513     0.2182  -1.610  0.10738    
## NCDCat40     -0.6997     0.2749  -2.545  0.01093 *  
## NCDCat50     -0.6505     0.1612  -4.035 5.45e-05 ***
## AgeCat2      10.1881   298.1364   0.034  0.97274    
## AgeCat3      10.2931   298.1363   0.035  0.97246    
## AgeCat4      10.1762   298.1363   0.034  0.97277    
## AgeCat5       9.9213   298.1363   0.033  0.97345    
## AgeCat6      10.6297   298.1364   0.036  0.97156    
## AgeCat7      10.9335   298.1371   0.037  0.97075    
## VAgeCat1      0.2344     0.1638   1.431  0.15251    
## VAgeCat2      0.3470     0.1596   2.174  0.02968 *  
## VAgeCat3      0.1254     0.2223   0.564  0.57257    
## VAgeCat4     -0.4589     0.2375  -1.932  0.05332 .  
## VAgeCat5     -1.2343     0.2749  -4.490 7.13e-06 ***
## VAgeCat6     -1.5824     0.5954  -2.657  0.00787 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2141.4  on 5985  degrees of freedom
## Residual deviance: 2035.0  on 5966  degrees of freedom
## AIC: 2850.4
## 
## Number of Fisher Scoring iterations: 11
# for nested model comparison, hence not use for reg2 and reg1 comparison 
anova(poisson_reg1,poisson_reg3,test="Chisq") 
## Analysis of Deviance Table
## 
## Model 1: Clm_Count ~ SexInsured
## Model 2: Clm_Count ~ SexInsured + NCDCat + AgeCat + VAgeCat
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1      5983     2130.3                         
## 2      5966     2035.0 17   95.327 6.48e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

d) Model Prediction

final_model<-poisson_reg3

df_train$predict<-exp(predict(final_model))/df_train$Exp_weights

mean(df_train$predict)
## [1] 0.1317536
mean(df_train$Frequency)
## [1] 0.1293873
sum(df_train$Clm_Count)/sum(df_train$Exp_weights)
## [1] 0.1319254

e) Fit model to hold-on data and test

final_model<-poisson_reg3

df_validate$predict<-exp(predict(final_model,newdata=df_validate))/df_validate$Exp_weights

mean(df_validate$predict)
## [1] 0.1305999
mean(df_validate$Frequency)
## [1] 0.1636985
sum(df_validate$Clm_Count)/sum(df_validate$Exp_weights)
## [1] 0.1443286

Step 5: Output Dataset

#Output to Excel or CSV

#write.csv(check,file="P:/YT_R/HandsOn/output.csv")

write.csv(df_validate,file="/Users/yuantian2019/Downloads/output.csv")
# or write_csv from readr package, which is faster than write.csv

End of the document